Single-cell transcriptome sequencing (scRNA-seq) experiments allow us to discover new cell types and help us understand how they arise in development. The Monocle 3 package provides a toolkit for analyzing single-cell gene expression experiments. In our scDown package, we provide a pseudotime trajectory analysis workflow with a single function that carries out the pre-processing, trajectory analysis, and downstream differential expression analysis. In addition to that, we also generates various high quality plots that are very crucial for the trajectory analysis.
We need two required information/variables to run the monocle3
function. Users need to load a Seurat object (seurat_obj)
containing single-cell RNA sequencing data. The species is
set to either “human” or “mouse,” as monocle3 supports only these two
species. The metadata column (annotation_column) is defined
to specify the cell type or other annotation labels, ensuring the
analysis runs with the correct input data (by default it takes
annotations from Idents). The users can also specify the working
directory for the output files and plots (By default it is current
directory). There are also a lot of other optional variables that the
users can pass which we are going to talk about in the different
workflows below.
# Set the working directory
output_dir <- "."
# Seurat object
seurat_obj <- readRDS("../scDown/inst/extdata/10X43_1_spliced_unspliced.rds")
# Define the species: either "mouse" or "human", since potency method only support these two species for identifying the root node
species <- "mouse"
# PCA Dimensions to preprocess the data
nDim <- 30
# Metadata cell type column
annotation_column <- "clusters"Users can load their scRNASeq data which they would like to run the trajectory analysis. To run the trajectory analysis, the users have to provide the scRNASeq data and species. By default, we would run the trajectory analysis on the whole dataset, identify the root nodes using potency method, and run trajectory analysis using monocle3.
library(scDown)
run_monocle3(seurat_obj = seurat_obj,
species = species,
nDim = nDim,
annotation_column = annotation_column,
graph_test=TRUE,
output_dir = output_dir)Running Time: ~ 5 minutes
Operating System: Rocky Linux 8.9 (Green Obsidian)
CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Memory: 10 GB
The first step of construction of the single cell trajectories is the
pre-processing step of the data. In this step, the data is normalized
and batch corrected (if the user provides the batch metadata in the
batch_metadata variable. By Default, it is NULL). After
that, the dimensionality reduction is carried out and UMAP coordinates
are calculated. The function can also use the UMAP coordinates from the
Seurat object (by setting transferUMAP=TRUE). The UMAP plot
shows the different cell types in the data after transferring the UMAP
coordinates.
Next, the cells are grouped into broad cell clusters represented in the data. Monocle does not assume that all cells in the dataset descend from a common transcriptional “ancestor”. In many experiments, there might in fact be multiple distinct trajectories. The monocle functions uses a technique called community detection to group cells. This approach was introduced by Levine et al as part of the phenoGraph algorithm. The UMAP plot shows the different partitions in the data.
Next, the trajectories in the single-cell data is learned by fitting a principal graph within each partition. These trajectories are further used to fit or order the cells in the pseudotime. Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation. In order to do that, monocle needs a root node which acts as a starting point for the differentiation or the biological process. The UMAP plot shows the different nodes along the trajectories.
In our workflow, we have three methods to identify the root node in
the single cell data. By default
(rootNode_method = "potency"), we use the methods developed
by Teschendorff et al. which uses the their gene expression profiles and
a protein-protein interaction network to automatically identify the root
node based on the potency. Below we show the UMAP plot that ordered the
cells based on the root node selected using potency.
The second method is to automatically identify the root node which is
most heavily occupied by early time point cells
((rootNode_method = "rootNodes")). For that, the users need
to provide the early timepoint category (timepoint) and the
meta data column (timePoint_metadata) in the seurat object.
The third method is that users can provide the root name by selecting
and passing the principal point (rootNode) on the
trajectory in the trajectory UMAP plot.
The pipeline also identifies the genes that vary between groups of
cells across the trajectories using the spatial autocorrelation analysis
called Moran’s I developed by Cao & Spielmann et al. This method
identifies genes that vary between groups of cells in UMAP space and is
effective in finding genes that vary in single-cell RNA-seq datasets.
All the significant trajectory-variable genes were exported in a csv
file (monocleDEG_significant_by_trajectory.csv).The plot
below shows the expression of the top trajectory-variable genes in a
feature plot.
We also visualized the expression levels of top trajectory-variable genes across the pseudotime and colored as cell types.
The set of the trajectory-variable genes is further grouped into
modules using Louvain community analysis. The clustering information is
exported in a csv file
(monocleDEG_by_trajectory_moduleInformation.csv). The
module heatmap below shows the expression levels of each module across
the cell types.
The function will also generate plots that shows the distribution of the cells across the pseudotime. Below the density plot shows the distribution of cells across the pseudotime for the complete dataset separated by the timepoints.
Below the histogram shows the distribution of cells across the pseudotime for the complete dataset per cell type separated by timepoints.
The workflow also generates the distribution of cells per condition too in the output folder.
In this part, we will show how the pipeline can be used to analyze the trajectories between the different conditions and compare the expression of the significant trajectory variable genes across the conditions using regression analysis.
The users need to provide the conditions and condition metadata in
the run_monocle3 function to carry out the trajectory
analysis across different conditions. This workflow also need the
deg_method (By default,
deg_method = "quasipoisson") to carry out the regression
analysis to identify those trajectory variable genes that differ between
the conditions. In our test data, we selected the age of the mouse in
days (12 and 35 days) as the conditions.
conditions <- c("12","35") # a string specifying a column in seurat_object@meta.data that contains the condition names, or NULL when no condition is filled.
condition_metadata <- "age.days."
deg_method = "quasipoisson"
library(scDown)
run_monocle3(seurat_obj=seurat_obj,
species=species,
nDim = nDim,
annotation_column = annotation_column,
graph_test=TRUE,
conditions = conditions,
group_column=condition_metadata,
deg_method = "quasipoisson",
output_dir = output_dir)Running Time: ~ 8 minutes
Operating System: Rocky Linux 8.9 (Green Obsidian)
CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Memory: 10 GB
The UMAP plot shows the different cell types in the data at timepoint 12.
Below we show the UMAP plot that ordered the cells based on the root node selected using potency at timepoint 12.
When conditions are supplied, the function will also generate plots that shows the distribution of the cells across the pseudotime separated by conditions. Below the density plot shows the distribution of cells across the pseudotime for the complete dataset separated by the timepoints.
Below the histogram shows the distribution of cells across the pseudotime for the complete dataset per cell type separated by timepoints.
The workflow also generates the distribution of cells per condition too in the output folder.
When conditions are supplied, the pipeline also identifies the
significant differential trajectory-variable genes between conditions by
fitting a regression model (using the conditions metadata) to each
identified trajectory-variable genes. All the significant differential
trajectory-variable genes were exported in a csv file
(monocleDEG_significant_by_age.days.+trajectory_12_35.csv).The
plot below shows the expression of the top trajectory-variable genes in
a feature plot.
We also visualized the expression levels of top trajectory-variable genes across the pseudotime and colored as cell types.
The pipeline can also be used to identify the significant
differential expression genes between conditions using a generalized
linear model using “quasipoisson” distribution by default. The users can
use four different models (“quasipoisson”, “negbinomial”,“poisson” or
“binomial”) by supplying it in deg_model variable. This is
an optional feature which can be turned on by supplying the metadata in
the metadata_deg_model.
conditions <- c("12","35") # a string specifying a column in seurat_object@meta.data that contains the condition names, or NULL when no condition is filled.
condition_metadata <- "age.days."
deg_method = "quasipoisson"
library(scDown)
run_monocle3(seurat_obj=seurat_obj,
species=species,
nDim = nDim,
annotation_column = annotation_column,
deg_method = "quasipoisson",
metadata_deg_model=condition_metadata,
output_dir = output_dir)Running Time: ~ 8 minutes
All the significant differential expressed genes were exported in a
csv file (monocleDEG_significant_by_age.days..csv).The plot
below shows the expression of the top differential expressed genes in a
feature plot.
We also visualized the expression levels of top differential
expressed genes in violin plot.
In this workflow, the users can also provide list of combinations of cell types between which they would like to analyze using the trajectory analysis workflow.
library(scDown)
cell_types_monocle<-list(c("Granule immature","Granule mature"))
library(scDown)
run_monocle3(seurat_obj=seurat_obj,
species=species,
nDim = nDim,
annotation_column = annotation_column,
conditions = conditions,
group_column=condition_metadata,
deg_method = "quasipoisson",
celltype_groups = cell_types_monocle,
graph_test=TRUE,
output_dir = output_dir)Running Time: ~10 minutes
Operating System: Rocky Linux 8.9 (Green Obsidian)
CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Memory: 10 GB
As mentioned previously, the trajectory in the cell type groups is learned by fitting a principal graph within each partition which are further used to fit or order the cells in the pseudotime. The UMAP plot shows the different nodes along the trajectories for the cell type groups.
In our workflow, we used the potency method to find the root node
(rootNode_method = "potency"). Below we show the UMAP plot
that ordered the cells based on the root node selected using potency. As
expected, the root node is found in the Granule immature cells.
After that, the pipeline identifies the genes that vary between cell
types across the trajectories using the spatial autocorrelation
analysis. All the significant trajectory-variable genes were exported in
a csv file
(monocleDEG_significant_by_trajectory_Granule immature_Granule mature.csv).The
plot below shows the expression of the top trajectory-variable genes in
a feature plot.
We also visualized the expression levels of top trajectory-variable genes across the pseudotime and colored as cell types.
The set of the trajectory-variable genes is further grouped into
modules using Louvain community analysis. The clustering information is
exported in a csv file
(monocleDEG_by_trajectory_moduleInformation_Granule immature_Granule mature.csv).
The module heatmap below shows the expression levels of each module
across the two cell types.
The function generate plots that shows the distribution of the cells across the pseudotime. Below the density plot shows the distribution of cells across the pseudotime for the cell types separated by the timepoints.
Below the histogram shows the distribution of cells across the pseudotime for the cell types separated by timepoints.
All outputs of the pipeline will be in the monocle folder, which is divided into:
The csv subfolder contains:
The images subfolder contains :
cellDistribution: cell and cell type distribution density plots plus histogram per (subsetted) object and condition.
DEG:
pseudotime: umap by cell types, partitions, learned trajectory, and pseudotime.
The rds subfolder contains:
This document outlines a pipeline for analyzing trajectory analysis using monocle3. The pipeline begins by defining key variables, such as the working directory, Seurat object, species, and metadata for cell type annotations. It runs monocle3 on the entire dataset to explore global trajectory analysis and generates visualizations like UMAPs, density, histogram, heatmaps, and violin plots, which highlight trajectories, expression of genes, distribution of cells across pseudotime, and expression of differential expressed genes.
Additionally, the pipeline supports focused analysis by selecting specific cell types or comparing user-defined groups (e.g., age groups 35 and 12). Pairwise comparisons are made to examine differences in trahetcory differences. This pipeline provides a comprehensive tool for exploring and comparing trajectory analysis.
The analysis was performed on a machine with the following specifications: